IDS PROJECT: Chicago Crime Data Analysis

INTRODUCTION

In this world, crimes are an indivisible piece of our lives. Consistently we catch wind of them and a few of us are even associated with crimes or encourage them by tolerating crimes during our life. Increased population, technological advancements and heightened competition for economic resources have given rise to a range of new social problems that need a resolution. The protection of life and property has been the primary goal for law enforcement officials.

We have to utilize present day innovation and information science procedures to astutely act against this issue. Crime data analysis makes it possible for law enforcement officials to objectively determine the nature of criminal activity and develop directed patrolling and tactical action plans to effectively combat it. At the same time, this analysis is also used to make sure officials are using their limited resources to their best advantage.

There are such a large number of records and documentation in the police office that have been assembled during the years, which can be utilized as an important wellspring of information for the data analysis assignments. This analysis will not just benefit law enforcement agencies but also other fields like real estate to determine the crime rate of a particular area and how it can affect the prices of real estate.

In today’s world, all of us give security a higher priority so, with this analysis we hope to lend a helping hand in making everyone feel safe.

GOAL: Help law enforcement officials use their limited resources in an efficient manner

QUESTIONS

  • What time of the day is the most patrolling required?
  • Which area in the city is the most dangerous?

Lets start by loading the packages we require for the project

In [2]:
import numpy as np
import pandas as pd
import pandas_profiling
from pandas import *
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy import stats
sns.set_style("darkgrid")
import matplotlib.image as mpimg
from IPython.display import IFrame
import warnings                   
warnings.filterwarnings("ignore")
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.model_selection import cross_validate as cv
from sklearn.model_selection import train_test_split
import folium
from folium import plugins
import folium.plugins
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
import chart_studio.plotly as py
import pylab as pl
from mpl_toolkits.mplot3d import Axes3D
from bubbly.bubbly import bubbleplot 
from plotly.graph_objs import Scatter, Figure, Layout
init_notebook_mode(connected=True)

Data Preprocessing

In [3]:
cdata=pd.read_csv("/Users/aishwaryanambiar/Documents/IDS Project/Crimes_-_2001_to_present.csv", iterator=True, chunksize=100000)
crime_data = concat(cdata, ignore_index=True)

Printing the shape of the data

In [7]:
crime_data.shape
Out[7]:
(6975046, 30)

Checking the dataset for its different columns

In [5]:
for col in crime_data.columns: 
    print(col) 
ID
Case Number
Date
Block
IUCR
Primary Type
Description
Location Description
Arrest
Domestic
Beat
District
Ward
Community Area
FBI Code
X Coordinate
Y Coordinate
Year
Updated On
Latitude
Longitude
Location
Historical Wards 2003-2015
Zip Codes
Community Areas
Census Tracts
Wards
Boundaries - ZIP Codes
Police Districts
Police Beats

Showing the first five rows of the dataframe crime_data

In [8]:
crime_data.head()
Out[8]:
ID Case Number Date Block IUCR Primary Type Description Location Description Arrest Domestic ... Longitude Location Historical Wards 2003-2015 Zip Codes Community Areas Census Tracts Wards Boundaries - ZIP Codes Police Districts Police Beats
0 11838562 JC445596 09/21/2019 11:59:00 PM 001XX S CENTRAL AVE 0281 CRIM SEXUAL ASSAULT NON-AGGRAVATED APARTMENT False False ... -87.764774 (41.877876427, -87.764774345) 52.0 22216.0 26.0 68.0 7.0 32.0 25.0 97.0
1 11836736 JC443437 09/21/2019 11:58:00 PM 082XX S WENTWORTH AVE 143A WEAPONS VIOLATION UNLAWFUL POSS OF HANDGUN STREET True False ... -87.629142 (41.744686105, -87.629141842) 18.0 21554.0 40.0 1.0 13.0 59.0 20.0 236.0
2 11836857 JC443451 09/21/2019 11:55:00 PM 021XX E 70TH ST 1710 OFFENSE INVOLVING CHILDREN ENDANGERING LIFE/HEALTH CHILD APARTMENT False False ... -87.573293 (41.768010786, -87.573293045) 32.0 22538.0 39.0 152.0 33.0 24.0 18.0 262.0
3 11836791 JC443351 09/21/2019 11:53:00 PM 067XX S BISHOP ST 041A BATTERY AGGRAVATED: HANDGUN STREET False False ... -87.660365 (41.771789341, -87.660364917) 17.0 22257.0 65.0 284.0 2.0 23.0 17.0 204.0
4 11836770 JC443354 09/21/2019 11:45:00 PM 053XX S CAMPBELL AVE 1320 CRIMINAL DAMAGE TO VEHICLE RESIDENTIAL YARD (FRONT/BACK) False False ... -87.686548 (41.796133405, -87.686547819) 49.0 22248.0 61.0 778.0 8.0 56.0 23.0 120.0

5 rows × 30 columns

Checking the different crimes in the column Primary Type

In [9]:
crimes = crime_data['Primary Type'].sort_values().unique()
crimes, len(crimes)
Out[9]:
(array(['ARSON', 'ASSAULT', 'BATTERY', 'BURGLARY',
        'CONCEALED CARRY LICENSE VIOLATION', 'CRIM SEXUAL ASSAULT',
        'CRIMINAL DAMAGE', 'CRIMINAL TRESPASS', 'DECEPTIVE PRACTICE',
        'DOMESTIC VIOLENCE', 'GAMBLING', 'HOMICIDE', 'HUMAN TRAFFICKING',
        'INTERFERENCE WITH PUBLIC OFFICER', 'INTIMIDATION', 'KIDNAPPING',
        'LIQUOR LAW VIOLATION', 'MOTOR VEHICLE THEFT', 'NARCOTICS',
        'NON - CRIMINAL', 'NON-CRIMINAL',
        'NON-CRIMINAL (SUBJECT SPECIFIED)', 'OBSCENITY',
        'OFFENSE INVOLVING CHILDREN', 'OTHER NARCOTIC VIOLATION',
        'OTHER OFFENSE', 'PROSTITUTION', 'PUBLIC INDECENCY',
        'PUBLIC PEACE VIOLATION', 'RITUALISM', 'ROBBERY', 'SEX OFFENSE',
        'STALKING', 'THEFT', 'WEAPONS VIOLATION'], dtype=object), 35)

The different Ward codes can be found here: https://www.chicago.gov/city/en/about/wards.html. From our reference we know that there are 50 wards

In [4]:
crimes = crime_data['Ward'].sort_values().unique()
crimes, len(crimes)
Out[4]:
(array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13.,
        14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26.,
        27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39.,
        40., 41., 42., 43., 44., 45., 46., 47., 48., 49., 50., nan]), 51)

IUCR stands for Illinois Uniform Crime Reporting, it encodes different nature of crime using a specific code table. the list of IUCR codes for different crimes can be found https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e/data.

In [5]:
crimes = crime_data['IUCR'].sort_values().unique()
crimes, len(crimes)
Out[5]:
(array(['0110', '0141', '0142', '0261', '0262', '0263', '0264', '0265',
        '0266', '0271', '0272', '0273', '0274', '0275', '0281', '0291',
        '0312', '0313', '031A', '031B', '0320', '0325', '0326', '0330',
        '0331', '0334', '0337', '033A', '033B', '0340', '041A', '041B',
        '0420', '0430', '0440', '0450', '0451', '0452', '0453', '0454',
        '0460', '0461', '0462', '0470', '0475', '0479', '0480', '0481',
        '0482', '0483', '0484', '0485', '0486', '0487', '0488', '0489',
        '0490', '0492', '0493', '0494', '0495', '0496', '0497', '0498',
        '0499', '0510', '051A', '051B', '0520', '0530', '0545', '0550',
        '0551', '0552', '0553', '0554', '0555', '0556', '0557', '0558',
        '0560', '0580', '0581', '0583', '0584', '0585', '0610', '0620',
        '0630', '0650', '0810', '0820', '0830', '0840', '0841', '0842',
        '0843', '0850', '0860', '0865', '0870', '0880', '0890', '0895',
        '0910', '0915', '0917', '0918', '0920', '0925', '0927', '0928',
        '0930', '0935', '0937', '0938', '1010', '1020', '1025', '1030',
        '1035', '1050', '1055', '1090', '1110', '1120', '1121', '1122',
        '1130', '1135', '1140', '1150', '1151', '1152', '1153', '1154',
        '1155', '1156', '1160', '1170', '1185', '1195', '1200', '1205',
        '1206', '1210', '1220', '1230', '1235', '1240', '1241', '1242',
        '1245', '1255', '1260', '1261', '1265', '1305', '1310', '1320',
        '1330', '1335', '1340', '1345', '1350', '1360', '1365', '1370',
        '1375', '141A', '141B', '141C', '142A', '142B', '1435', '143A',
        '143B', '143C', '1440', '1450', '1460', '1476', '1477', '1478',
        '1479', '1480', '1481', '1505', '1506', '1507', '1510', '1511',
        '1512', '1513', '1515', '1520', '1521', '1525', '1526', '1530',
        '1531', '1535', '1536', '1537', '1540', '1541', '1544', '1549',
        '1562', '1563', '1564', '1565', '1566', '1570', '1572', '1574',
        '1576', '1578', '1580', '1581', '1582', '1585', '1590', '1610',
        '1611', '1620', '1621', '1622', '1624', '1625', '1626', '1627',
        '1630', '1631', '1633', '1640', '1650', '1651', '1661', '1670',
        '1680', '1681', '1682', '1697', '1710', '1715', '1720', '1725',
        '1750', '1751', '1752', '1753', '1754', '1755', '1780', '1790',
        '1791', '1792', '1811', '1812', '1821', '1822', '1840', '1850',
        '1860', '1900', '2010', '2011', '2012', '2013', '2014', '2015',
        '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023',
        '2024', '2025', '2026', '2027', '2028', '2029', '2030', '2031',
        '2032', '2033', '2034', '2040', '2050', '2060', '2070', '2080',
        '2090', '2091', '2092', '2093', '2094', '2095', '2110', '2111',
        '2120', '2160', '2170', '2210', '2220', '2230', '2240', '2250',
        '2251', '2820', '2825', '2826', '2830', '2840', '2850', '2851',
        '2860', '2870', '2890', '2895', '2900', '3000', '3100', '3200',
        '3300', '3400', '3610', '3710', '3720', '3730', '3731', '3740',
        '3750', '3751', '3760', '3770', '3800', '3910', '3920', '3960',
        '3961', '3966', '3970', '3975', '3980', '4210', '4220', '4230',
        '4240', '4255', '4310', '4386', '4387', '4388', '4389', '4510',
        '4625', '4650', '4651', '4652', '4740', '4750', '4800', '4810',
        '4860', '5000', '5001', '5002', '5003', '5004', '5005', '5007',
        '5008', '5009', '500E', '500N', '5011', '5013', '501A', '501H',
        '502P', '502R', '502T', '5073', '5093', '5094', '5110', '5111',
        '5112', '5113', '5114', '5120', '5121', '5122', '5130', '5131',
        '5132', '9901'], dtype=object), 402)

The different District codes can be found here: https://home.chicagopolice.org/community/districts/.

In [10]:
crimes = crime_data['District'].sort_values().unique()
crimes, len(crimes)
Out[10]:
(array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 14.,
        15., 16., 17., 18., 19., 20., 21., 22., 24., 25., 31., nan]), 25)

From our reference we know that there are 25 districts. Hence the value 31 is an error. District 13 and 23 are not listed.

In [24]:
crimes = crime_data['Year'].sort_values().unique()
crimes, len(crimes)
Out[24]:
(array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
        2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]), 19)

Showing the Map of Chicago divided by District

In [12]:
plt.figure(figsize=(10,18))
img = mpimg.imread('/Users/aishwaryanambiar/Documents/ABI Project/chicago_map1.png')
plt.imshow(img)
Out[12]:
<matplotlib.image.AxesImage at 0x1c2b839d68>

As we can see from the above figure there are 25 districts in Chicago but the data set seems to show a district 31 which clearly does not exist. Hence, we can conclude that it is an error and so we will now drop that value from the dataset so as to not have any discrepancies in our analysis due to incorrect data.

In [8]:
crime_data = crime_data[crime_data['District'] != 31]
In [19]:
crimes = crime_data['District'].sort_values().unique()
crimes, len(crimes)
Out[19]:
(array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 14.,
        15., 16., 17., 18., 19., 20., 21., 22., 24., 25., nan]), 24)

Dropping Years 2001 and 2019 because when we moved further into our analysis we found that most of the NULL values were in the year 2001 and using 2019 would obscure our results because the year is not over. Hence the analysis will now be of the data from 2002-2018

In [9]:
crime_data = crime_data[crime_data['Year'] != 2001]
In [10]:
crime_data = crime_data[crime_data['Year'] != 2019]
In [27]:
crimes = crime_data['Year'].sort_values().unique()
crimes, len(crimes)
Out[27]:
(array([2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012,
        2013, 2014, 2015, 2016, 2017, 2018]), 17)

Classifying the Crimes

For the purposes of analysis and cleaning, the crimes need to be classified into different categories. The categories i will use are as follows:

  1. Violent Crimes
  2. Property-Related Crimes
  3. Gang-related Crimes
  4. Sex Crimes
  5. Non-violent Crimes

Violent Crimes

The following are the different crimes in this category

  • Homicide
  • Assault
  • Battery
  • Criminal Sexual Assualt
  • Domestic Violence
  • Human Trafficking
  • Kidnapping
  • Robbery

Selecting the different columns we need for the analysis and classifying the different crimes.

In [11]:
col2 = ['ID','Year','Date','Primary Type','Arrest','Domestic','District','Ward','IUCR','X Coordinate','Y Coordinate','Latitude','Longitude','Location','Location Description']
violent_crimes = crime_data[col2]
violent_crimes = violent_crimes[violent_crimes['Primary Type']\
                  .isin(['HOMICIDE','ASSAULT','BATTERY','CRIM SEXUAL ASSAULT', 'DOMESTIC VIOLENCE', 'HUMAN TRAFFICKING', 'KIDNAPPING', 'ROBBERY'])]

# clean some rouge (0,0) coordinates
violent_crimes = violent_crimes[violent_crimes['X Coordinate']!=0]


violent_crimes.head()
Out[11]:
ID Year Date Primary Type Arrest Domestic District Ward IUCR X Coordinate Y Coordinate Latitude Longitude Location Location Description
188572 11552724 2018 12/31/2018 11:56:00 PM BATTERY True False 12.0 25.0 0440 1168327.0 1891230.0 41.857068 -87.657625 (41.857068095, -87.657625201) OTHER
188573 11552731 2018 12/31/2018 11:55:00 PM BATTERY False False 6.0 17.0 0486 1171332.0 1852934.0 41.751914 -87.647717 (41.75191443, -87.647716532) APARTMENT
188574 11552715 2018 12/31/2018 11:49:00 PM BATTERY False False 15.0 29.0 041A 1140262.0 1897810.0 41.875684 -87.760479 (41.87568438, -87.760479356) STREET
188575 11552741 2018 12/31/2018 11:48:00 PM BATTERY False True 6.0 21.0 0486 1167710.0 1852264.0 41.750154 -87.661009 (41.750154295, -87.661008708) APARTMENT
188576 11552602 2018 12/31/2018 11:47:00 PM BATTERY False False 19.0 32.0 0460 1163636.0 1921279.0 41.939625 -87.673996 (41.939624824, -87.67399611) VEHICLE - OTHER RIDE SHARE SERVICE (E.G., UBER...

Using some visualizations to see how the different crimes as distributed throughout the city

In [101]:
g = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=violent_crimes.dropna(), 
               col_wrap=2, size=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})

Exploring the data for cleaning

Below we are checking the attributes of our dataframe

In [102]:
violent_crimes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1813434 entries, 188572 to 6489266
Data columns (total 15 columns):
ID                      int64
Year                    int64
Date                    datetime64[ns]
Primary Type            object
Arrest                  bool
Domestic                bool
District                float64
Ward                    float64
IUCR                    object
X Coordinate            float64
Y Coordinate            float64
Latitude                float64
Longitude               float64
Location                object
Location Description    object
dtypes: bool(2), datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 197.2+ MB

Checking for Unique values

In [103]:
print(violent_crimes.apply(lambda x: len(x.unique())))
ID                      1813434
Year                         17
Date                    1208218
Primary Type                  7
Arrest                        2
Domestic                      2
District                     24
Ward                         51
IUCR                         83
X Coordinate              67263
Y Coordinate             110749
Latitude                 364534
Longitude                364399
Location                 364728
Location Description        176
dtype: int64

Now we will check the data for null values

In [104]:
violent_crimes.isnull().sum()
Out[104]:
ID                          0
Year                        0
Date                        0
Primary Type                0
Arrest                      0
Domestic                    0
District                   16
Ward                    39208
IUCR                        0
X Coordinate             9766
Y Coordinate             9766
Latitude                 9766
Longitude                9766
Location                 9766
Location Description       10
dtype: int64

In the above data there are many null values in the attributes related to Location which makes the analysis using location difficult. But the null values in the District attribute are comparitively low so the analysis can be done on the basis of the different districts of the city of Chicago. Hence we will drop all the null values.

Dropping the null values

In [12]:
violent_crimes_clean = violent_crimes.dropna()
violent_crimes_clean.isnull().sum().sum()
Out[12]:
0

Analysing all the violent crimes per district

In [106]:
violent_crimes_clean = violent_crimes_clean.loc[(violent_crimes_clean['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=violent_crimes_clean[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           size=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Violent Crimes (2002-2018) per District")
Out[106]:
Text(0.5, 1, 'All Violent Crimes (2002-2018) per District')

Doing some date time processing

In [35]:
violent_crimes_clean['Date'] = pd.to_datetime(violent_crimes_clean.Date) 
violent_crimes_clean['date'] = [d.date() for d in violent_crimes_clean['Date']]
violent_crimes_clean['time'] = [d.time() for d in violent_crimes_clean['Date']]

violent_crimes_clean['time'] = violent_crimes_clean['time'].astype(str)
empty_list = []
for timestr in violent_crimes_clean['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
violent_crimes_clean['seconds'] = empty_list

Analysing all the Violent Crimes Yearly

In [36]:
violent_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Violent Crimes')
plt.show()

From the figure above we can conclude that the violent crimes were highest in the year 2003 and lowest in the year 2015.

Analysing the Arrests made in relation to Violent Crimes

In [37]:
violent_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Violent Crimes')
plt.show()

The figure above shows that there were very few arrests made in relation to violent crimes.

Creating subset for clustering

Here we are creating a subset of the violent crimes data set that includes the attributes 'Ward', 'IUCR' and 'District'.

In [38]:
sub_data = violent_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data = sub_data.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data['IUCR'] = sub_data.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data.head()
Out[38]:
Ward IUCR District
188572 25.0 440 12.0
188573 17.0 486 6.0
188574 29.0 41 15.0
188575 21.0 486 6.0
188576 32.0 460 19.0

Elbow curve

In order to decide how many clusters we must use, we will use the elbow curve to determine the optimal number of clusters to use.

Here the data is not normalized.

In [39]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
plt.plot(N,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show() 

Here the the optimal number of clusters is 5 when the data is not normalized. But for the purposes of KMeans clustering we need to normalize data because otherwise the clustering will be done solely on the basis of euclidean distances.

But we will still apply K-means to see the difference between clustering with normalized data and data that is not.

In [40]:
km = KMeans(n_clusters=5)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Cluster'] = y
In [41]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data["Cluster"], s=60, cmap="jet")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()

Now, we need to normalize the data in order avoid clustering solely on the basis of Euclidean distance.

In [42]:
sub_data['IUCR'] = (sub_data['IUCR'] - sub_data['IUCR'].min())/(sub_data['IUCR'].max()-sub_data['IUCR'].min())
sub_data['Ward'] = (sub_data['Ward'] - sub_data['Ward'].min())/(sub_data['Ward'].max()-sub_data['Ward'].min())
sub_data['District'] = (sub_data['District'] - sub_data['District'].min())/(sub_data['District'].max()-sub_data['District'].min())

Now we will again use the Elbow method to the optimal number of clusters to be used for the normalized data.

In [44]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

del sub_data['Cluster']

From the above elbow curve, we can see the optimal number of clusters is 4.

Now we will apply KMeans.

In [45]:
km = KMeans(n_clusters=4)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Clusters'] = y
In [46]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()

Normalizing the time to be between 0 and 1, this way lower values would indicate midnight to early morning, medium values would indicate the afternoon sessions, and high values would indicate evening and night time also kmeans then won't cluster just based on the time as the range of euclidean distances in time column will be very high without scaling.

In [47]:
violent_crimes_clean['Normalized_time'] = (violent_crimes_clean['seconds'] - violent_crimes_clean['seconds'].min())/(violent_crimes_clean['seconds'].max()-violent_crimes_clean['seconds'].min())

Now we will perform clustering using 'IUCR', 'Normalized_time' and 'District'

In [48]:
sub_data1 = violent_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data1['IUCR'] = sub_data1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data1['IUCR'] = (sub_data1['IUCR'] - sub_data1['IUCR'].min())/(sub_data1['IUCR'].max()-sub_data1['IUCR'].min())
sub_data1['District'] = (sub_data1['District'] - sub_data1['District'].min())/(sub_data1['District'].max()-sub_data1['District'].min())
sub_data1.head()
Out[48]:
IUCR Normalized_time District
188572 0.096828 0.997234 0.458333
188573 0.107718 0.996539 0.208333
188574 0.002367 0.992373 0.583333
188575 0.107718 0.991678 0.208333
188576 0.101562 0.990984 0.750000

Using Elbow method to determine optimal number of clusters

In [49]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data1).score(sub_data1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

From the Elbow curve we can see the optimal number of clusters is 4. Hence we will now apply KMeans using 4 clusters

In [50]:
km = KMeans(n_clusters=4)
km.fit(sub_data1)
y = km.predict(sub_data1)
labels = km.labels_
sub_data1['Clusters'] = y
sub_data1.head()
Out[50]:
IUCR Normalized_time District Clusters
188572 0.096828 0.997234 0.458333 0
188573 0.107718 0.996539 0.208333 0
188574 0.002367 0.992373 0.583333 2
188575 0.107718 0.991678 0.208333 0
188576 0.101562 0.990984 0.750000 2
In [51]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data1['Normalized_time'])
y = np.array(sub_data1['IUCR'])
z = np.array(sub_data1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

All these clusters look colorful but what do they tell us? This is one of the biggest drawbacks of the K-means clustering technique that unless you know the data really well the clusters don’t make sense. Hence we will apply Agglomerative Clustering to overcome these drawbacks and visualize the clusters in the form of a heatmap so we can analyze the data better.

Standardizing the datetime for Agglomerative Clustering

In [13]:
from datetime import datetime
violent_crimes_clean['Date'] = pd.to_datetime(violent_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p') 
crime_data['Date']=  pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
In [14]:
for i in (violent_crimes_clean,crime_data):
    i['year']=i.Date.dt.year 
    i['month']=i.Date.dt.month 
    i['day']=i.Date.dt.day
    i['Hour']=i.Date.dt.hour
In [89]:
hour_by_type     = violent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = violent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)

Once we have created the parameters for our cluster analysis we will now apply Agglomerative Clustering.

In [90]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)

In order to understand the cluster analysis better we will visulize the clusters using a heatmap.

Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.

In [92]:
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)

From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to violent crimes:

The time of the day when violent crimes are most likely to occur is from the late hours of the day to the early hours of the morning.

Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.

In [93]:
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)

From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to violent crimes:

From the above figure we can see that the problem areas in the city are Districts 1 to 12 and 25 and the relatively safe areas are districts 13 to 24.

Property Related Crimes

  1. Arson
  2. Burglary
  3. Criminal Damage
  4. Criminal Trespass
  5. Motor Vehicle Theft
  6. Theft
In [66]:
property_crimes = crime_data[col2]
property_crimes = property_crimes[property_crimes['Primary Type']\
                  .isin(['ARSON','BURGLARY','CRIMINAL DAMAGE','CRIMINAL TRESPASS', 'MOTOR VEHICLE THEFT', 'THEFT'])]

# clean some rouge (0,0) coordinates
property_crimes = property_crimes[property_crimes['X Coordinate']!=0]


property_crimes.head()
Out[66]:
ID Year Date Primary Type Arrest Domestic District Ward IUCR X Coordinate Y Coordinate Latitude Longitude Location Location Description
188570 11556487 2018 2018-12-31 23:59:00 CRIMINAL DAMAGE False False 22.0 19.0 1320 1158309.0 1829936.0 41.689079 -87.696064 (41.689078832, -87.696064026) STREET
188571 11552699 2018 2018-12-31 23:57:00 CRIMINAL DAMAGE False False 6.0 21.0 1310 1171454.0 1848783.0 41.740521 -87.647391 (41.740520866, -87.647390719) APARTMENT
188577 11554852 2018 2018-12-31 23:45:00 CRIMINAL DAMAGE False False 14.0 26.0 1310 1154587.0 1908798.0 41.905562 -87.707589 (41.905562114, -87.707588672) APARTMENT
188578 11553488 2018 2018-12-31 23:45:00 THEFT False False 19.0 44.0 0890 1169040.0 1921647.0 41.940519 -87.654124 (41.940518859, -87.6541242) BAR OR TAVERN
188579 11552570 2018 2018-12-31 23:44:00 CRIMINAL TRESPASS True False 19.0 46.0 1330 1167451.0 1931818.0 41.968463 -87.659670 (41.968462892, -87.659670442) MOVIE HOUSE/THEATER
In [117]:
p = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=property_crimes.dropna(), 
               col_wrap=2, size=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})
In [118]:
property_crimes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2889993 entries, 188570 to 6489284
Data columns (total 15 columns):
ID                      int64
Year                    int64
Date                    datetime64[ns]
Primary Type            object
Arrest                  bool
Domestic                bool
District                float64
Ward                    float64
IUCR                    object
X Coordinate            float64
Y Coordinate            float64
Latitude                float64
Longitude               float64
Location                object
Location Description    object
dtypes: bool(2), datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 314.2+ MB
In [119]:
property_crimes.isnull().sum()
Out[119]:
ID                          0
Year                        0
Date                        0
Primary Type                0
Arrest                      0
Domestic                    0
District                   16
Ward                    57684
IUCR                        0
X Coordinate            23482
Y Coordinate            23482
Latitude                23482
Longitude               23482
Location                23482
Location Description      722
dtype: int64
In [67]:
property_crimes_clean = property_crimes.dropna()
property_crimes_clean.isnull().sum().sum()
Out[67]:
0

Analysing all property related crime per district

In [121]:
property_crimes_clean = property_crimes_clean.loc[(property_crimes_clean['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=property_crimes_clean[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           size=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Property Related Crimes (2001-present) per District")
Out[121]:
Text(0.5, 1, 'All Property Related Crimes (2001-present) per District')

Doing some date time processing

In [122]:
property_crimes_clean['Date'] = pd.to_datetime(property_crimes_clean.Date) 
property_crimes_clean['date'] = [d.date() for d in property_crimes_clean['Date']]
property_crimes_clean['time'] = [d.time() for d in property_crimes_clean['Date']]

property_crimes_clean['time'] = property_crimes_clean['time'].astype(str)
empty_list = []
for timestr in property_crimes_clean['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
property_crimes_clean['seconds'] = empty_list

Analysing the Property Related crime yearly

In [123]:
property_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Property Crimes')
plt.show()

From the figure above we can see that property related crimes were the highest in the year 2003 and lowest in the year 2015.

Analysing the Arrests made in relation to Property Related Crimes

In [124]:
property_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Property Crimes')
plt.show()

From the figure above we can see that there were very less arrests in relation to property related crimes.

Creating a subset for clustering

In [125]:
sub_data_prop = property_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_prop = sub_data_prop.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_prop['IUCR'] = sub_data_prop.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_prop.head()
Out[125]:
Ward IUCR District
188570 19.0 1320 22.0
188571 21.0 1310 6.0
188577 26.0 1310 14.0
188578 44.0 890 19.0
188579 46.0 1330 19.0
In [126]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop).score(sub_data_prop) for i in range(len(kmeans))]
score
plt.plot(N,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show() 
In [129]:
km = KMeans(n_clusters=3)
km.fit(sub_data_prop)
y = km.predict(sub_data_prop)
labels = km.labels_
sub_data_prop['Cluster'] = y
In [130]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop['Ward'])
y = np.array(sub_data_prop['IUCR'])
z = np.array(sub_data_prop['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_prop["Cluster"], s=60, cmap="jet")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
In [131]:
sub_data_prop['IUCR'] = (sub_data_prop['IUCR'] - sub_data_prop['IUCR'].min())/(sub_data_prop['IUCR'].max()-sub_data_prop['IUCR'].min())
sub_data_prop['Ward'] = (sub_data_prop['Ward'] - sub_data_prop['Ward'].min())/(sub_data_prop['Ward'].max()-sub_data_prop['Ward'].min())
sub_data_prop['District'] = (sub_data_prop['District'] - sub_data_prop['District'].min())/(sub_data_prop['District'].max()-sub_data_prop['District'].min())
In [132]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop).score(sub_data_prop) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

del sub_data_prop['Cluster']
In [133]:
km = KMeans(n_clusters=4)
km.fit(sub_data_prop)
y = km.predict(sub_data_prop)
labels = km.labels_
sub_data_prop['Clusters'] = y
In [134]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop['Ward'])
y = np.array(sub_data_prop['IUCR'])
z = np.array(sub_data_prop['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_prop["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
In [135]:
property_crimes_clean['Normalized_time'] = (property_crimes_clean['seconds'] - property_crimes_clean['seconds'].min())/(property_crimes_clean['seconds'].max()-property_crimes_clean['seconds'].min())
In [136]:
sub_data_prop1 = property_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_prop1['IUCR'] = sub_data_prop1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_prop1['IUCR'] = (sub_data_prop1['IUCR'] - sub_data_prop1['IUCR'].min())/(sub_data_prop1['IUCR'].max()-sub_data_prop1['IUCR'].min())
sub_data_prop1['District'] = (sub_data_prop1['District'] - sub_data_prop1['District'].min())/(sub_data_prop1['District'].max()-sub_data_prop1['District'].min())
sub_data_prop1.head()
Out[136]:
IUCR Normalized_time District
188570 0.928105 0.999317 0.875000
188571 0.915033 0.997928 0.208333
188577 0.915033 0.989595 0.541667
188578 0.366013 0.989595 0.750000
188579 0.941176 0.988900 0.750000
In [137]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop1).score(sub_data_prop1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [138]:
km = KMeans(n_clusters=5)
km.fit(sub_data_prop1)
y = km.predict(sub_data_prop1)
labels = km.labels_
sub_data_prop1['Clusters'] = y
sub_data_prop1.head()
Out[138]:
IUCR Normalized_time District Clusters
188570 0.928105 0.999317 0.875000 1
188571 0.915033 0.997928 0.208333 1
188577 0.915033 0.989595 0.541667 1
188578 0.366013 0.989595 0.750000 3
188579 0.941176 0.988900 0.750000 1
In [139]:
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop1['Normalized_time'])
y = np.array(sub_data_prop1['IUCR'])
z = np.array(sub_data_prop1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_prop1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

Standardizing the datetime for Agglomerative Clustering

In [68]:
from datetime import datetime
property_crimes_clean['Date'] = pd.to_datetime(property_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p') 
crime_data['Date']=  pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
In [69]:
for i in (property_crimes_clean,crime_data):
    i['year']=i.Date.dt.year 
    i['month']=i.Date.dt.month 
    i['day']=i.Date.dt.day
    i['Hour']=i.Date.dt.hour
In [70]:
hour_by_type     = property_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = property_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)

Implementing Agglomerative Clustering

In [71]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)

Visualizing the clusters using heatmaps

Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.

In [64]:
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)

From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to property related crimes:

The time of the day when property related crimes are most likely to occur is from the midday to the early hours of the morning like 1am.

Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.

In [72]:
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)

From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to property related crimes:

From the above figure we can see that the problem areas in the city are Districts 1 to 12 and 25 and the relatively safe areas are districts 13 to 24.

Gang Related Crimes

  1. Homicide
  2. Concealed Carry License Violation
  3. Narcotics
  4. Weapons Violation
In [40]:
gang_crimes = crime_data[col2]
gang_crimes = gang_crimes[gang_crimes['Primary Type']\
                  .isin(['HOMICIDE','CONCEALED CARRY LICENSE VIOLATION','NARCOTICS','WEAPONS VIOLATION'])]

# clean some rouge (0,0) coordinates
gang_crimes = gang_crimes[gang_crimes['X Coordinate']!=0]


gang_crimes.head()
Out[40]:
ID Year Date Primary Type Arrest Domestic District Ward IUCR X Coordinate Y Coordinate Latitude Longitude Location Location Description
188580 11552603 2018 2018-12-31 23:43:00 WEAPONS VIOLATION True False 7.0 6.0 143A 1176589.0 1857611.0 41.764632 -87.628312 (41.764632089, -87.628311641) GAS STATION
188581 11552568 2018 2018-12-31 23:42:00 NARCOTICS True False 25.0 30.0 2022 1136522.0 1919634.0 41.935640 -87.773689 (41.935639786, -87.773688687) ALLEY
188582 11552739 2018 2018-12-31 23:40:00 WEAPONS VIOLATION True False 6.0 8.0 1477 1183113.0 1846339.0 41.733551 -87.604749 (41.733551299, -87.604749489) STREET
188588 11552637 2018 2018-12-31 23:30:00 NARCOTICS True False 6.0 21.0 1822 1171139.0 1848199.0 41.738925 -87.648562 (41.738925174, -87.648561871) STREET
188593 11552601 2018 2018-12-31 23:26:00 WEAPONS VIOLATION True False 22.0 34.0 1477 1171769.0 1837673.0 41.710027 -87.646561 (41.710026548, -87.646561348) RESIDENTIAL YARD (FRONT/BACK)
In [143]:
h = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=gang_crimes.dropna(), 
               col_wrap=2, size=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})
In [144]:
gang_crimes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 738872 entries, 188580 to 6489041
Data columns (total 15 columns):
ID                      738872 non-null int64
Year                    738872 non-null int64
Date                    738872 non-null datetime64[ns]
Primary Type            738872 non-null object
Arrest                  738872 non-null bool
Domestic                738872 non-null bool
District                738869 non-null float64
Ward                    720545 non-null float64
IUCR                    738872 non-null object
X Coordinate            728763 non-null float64
Y Coordinate            728763 non-null float64
Latitude                728763 non-null float64
Longitude               728763 non-null float64
Location                728763 non-null object
Location Description    738870 non-null object
dtypes: bool(2), datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 80.3+ MB
In [145]:
gang_crimes.isnull().sum()
Out[145]:
ID                          0
Year                        0
Date                        0
Primary Type                0
Arrest                      0
Domestic                    0
District                    3
Ward                    18327
IUCR                        0
X Coordinate            10109
Y Coordinate            10109
Latitude                10109
Longitude               10109
Location                10109
Location Description        2
dtype: int64
In [41]:
gang_crimes_clean = gang_crimes.dropna()
gang_crimes_clean.isnull().sum().sum()
Out[41]:
0

Analysing all Gang- Related crimes per district

In [147]:
gang_crimes_clean = gang_crimes_clean.loc[(gang_crimes_clean['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=gang_crimes_clean[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           size=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Gang Related Crimes (2001-present) per District")
Out[147]:
Text(0.5, 1, 'All Gang Related Crimes (2001-present) per District')

Doing same date time processing

In [148]:
gang_crimes_clean['Date'] = pd.to_datetime(gang_crimes_clean.Date) 
gang_crimes_clean['date'] = [d.date() for d in gang_crimes_clean['Date']]
gang_crimes_clean['time'] = [d.time() for d in gang_crimes_clean['Date']]

gang_crimes_clean['time'] = gang_crimes_clean['time'].astype(str)
empty_list = []
for timestr in gang_crimes_clean['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
gang_crimes_clean['seconds'] = empty_list

Analysis of Gang Related Crimes yearly

In [149]:
gang_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Gang Crimes')
plt.show()

From the figure below we can conclude that gang related crimes were the highest in the year 2004 and lowest in the year 2017

Analysis of Arrests in relation to Gang Related Crimes

In [150]:
gang_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Gang Crimes')
plt.show()

The figure above shows that there were more arrests in gang related crimes.

Implementing K-means clustering

In [151]:
sub_data_gang = gang_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_gang = sub_data_gang.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_gang['IUCR'] = sub_data_gang.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_gang.head()
Out[151]:
Ward IUCR District
188580 6.0 143 7.0
188581 30.0 2022 25.0
188582 8.0 1477 6.0
188588 21.0 1822 6.0
188593 34.0 1477 22.0
In [152]:
sub_data_gang['IUCR'] = (sub_data_gang['IUCR'] - sub_data_gang['IUCR'].min())/(sub_data_gang['IUCR'].max()-sub_data_gang['IUCR'].min())
sub_data_gang['Ward'] = (sub_data_gang['Ward'] - sub_data_gang['Ward'].min())/(sub_data_gang['Ward'].max()-sub_data_gang['Ward'].min())
sub_data_gang['District'] = (sub_data_gang['District'] - sub_data_gang['District'].min())/(sub_data_gang['District'].max()-sub_data_gang['District'].min())
In [153]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_gang).score(sub_data_gang) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [154]:
km = KMeans(n_clusters=4)
km.fit(sub_data_gang)
y = km.predict(sub_data_gang)
labels = km.labels_
sub_data_gang['Clusters'] = y
In [155]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_gang['Ward'])
y = np.array(sub_data_gang['IUCR'])
z = np.array(sub_data_gang['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_gang["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
In [156]:
gang_crimes_clean['Normalized_time'] = (gang_crimes_clean['seconds'] - gang_crimes_clean['seconds'].min())/(gang_crimes_clean['seconds'].max()-gang_crimes_clean['seconds'].min())
In [157]:
sub_data_gang1 = gang_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_gang1['IUCR'] = sub_data_gang1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_gang1['IUCR'] = (sub_data_gang1['IUCR'] - sub_data_gang1['IUCR'].min())/(sub_data_gang1['IUCR'].max()-sub_data_gang1['IUCR'].min())
sub_data_gang1['District'] = (sub_data_gang1['District'] - sub_data_gang1['District'].min())/(sub_data_gang1['District'].max()-sub_data_gang1['District'].min())
sub_data_gang1.head()
Out[157]:
IUCR Normalized_time District
188580 0.011828 0.988217 0.250000
188581 0.685305 0.987523 1.000000
188582 0.489964 0.986134 0.208333
188588 0.613620 0.979189 0.208333
188593 0.489964 0.976411 0.875000
In [158]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_gang1).score(sub_data_gang1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [159]:
km = KMeans(n_clusters=5)
km.fit(sub_data_gang1)
y = km.predict(sub_data_gang1)
labels = km.labels_
sub_data_gang1['Clusters'] = y
sub_data_gang1.head()
Out[159]:
IUCR Normalized_time District Clusters
188580 0.011828 0.988217 0.250000 2
188581 0.685305 0.987523 1.000000 3
188582 0.489964 0.986134 0.208333 4
188588 0.613620 0.979189 0.208333 4
188593 0.489964 0.976411 0.875000 3
In [160]:
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_gang1['Normalized_time'])
y = np.array(sub_data_gang1['IUCR'])
z = np.array(sub_data_gang1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_gang1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

Standardizing datetime for Agglomerative Clustering

In [42]:
from datetime import datetime
gang_crimes_clean['Date'] = pd.to_datetime(gang_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p') 
crime_data['Date']=  pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
In [43]:
for i in (gang_crimes_clean,crime_data):
    i['year']=i.Date.dt.year 
    i['month']=i.Date.dt.month 
    i['day']=i.Date.dt.day
    i['Hour']=i.Date.dt.hour
In [59]:
hour_by_type     = gang_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district     = gang_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)

Implementing Agglomerative Clustering

In [60]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)

Visualizing the clusters using heatmaps

Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.

In [27]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)

From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to gang related crimes:

The time of the day when gang related crimes are most likely to occur is from the 6 pm to the early hours of the morning till 2 am.

Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.

In [61]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)

From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to gang related crimes:

From the above figure we can see that the problem areas in the city are Districts 3 to 11, 15 and 25 and the relatively safe areas are districts 1,2,13,14,16 to 24.

Sex Crimes

  1. Criminal Sexual Assault
  2. Obscenity
  3. Prostitution
  4. Public Indecency
  5. Sex Offense
  6. Offense Involving Children
In [73]:
sex_crimes = crime_data[col2]
sex_crimes = sex_crimes[sex_crimes['Primary Type']\
                  .isin(['CRIM SEXUAL ASSAULT','OBSCENITY','PROSTITUTION','PUBLIC INDECENCY', 'SEX OFFENSE', 'OFFENSE INVOLVING CHILDREN'])]

# clean some rouge (0,0) coordinates
sex_crimes = sex_crimes[sex_crimes['X Coordinate']!=0]


sex_crimes.head()
Out[73]:
ID Year Date Primary Type Arrest Domestic District Ward IUCR X Coordinate Y Coordinate Latitude Longitude Location Location Description
188646 11554560 2018 2018-12-31 22:00:00 CRIM SEXUAL ASSAULT False False 8.0 17.0 0281 1159381.0 1859440.0 41.770021 -87.691334 (41.770020864, -87.6913338) APARTMENT
188776 11583288 2018 2018-12-31 18:00:00 OFFENSE INVOLVING CHILDREN False True 16.0 38.0 1752 NaN NaN NaN NaN NaN RESIDENCE
188845 11552383 2018 2018-12-31 16:00:00 CRIM SEXUAL ASSAULT False False 4.0 7.0 0266 1194027.0 1835003.0 41.702184 -87.565137 (41.70218374, -87.565137272) RESIDENCE
188879 11552396 2018 2018-12-31 15:00:00 OFFENSE INVOLVING CHILDREN False False 6.0 17.0 1780 1175024.0 1851546.0 41.748024 -87.634228 (41.748024056, -87.634228405) APARTMENT
188903 11552218 2018 2018-12-31 14:30:00 OFFENSE INVOLVING CHILDREN False True 5.0 34.0 1780 1168218.0 1823487.0 41.671175 -87.659972 (41.671174783, -87.659971805) RESIDENCE
In [162]:
s = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=sex_crimes.dropna(), 
               col_wrap=2, size=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})
In [163]:
sex_crimes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 155587 entries, 188646 to 6489278
Data columns (total 15 columns):
ID                      155587 non-null int64
Year                    155587 non-null int64
Date                    155587 non-null datetime64[ns]
Primary Type            155587 non-null object
Arrest                  155587 non-null bool
Domestic                155587 non-null bool
District                155583 non-null float64
Ward                    152106 non-null float64
IUCR                    155587 non-null object
X Coordinate            149716 non-null float64
Y Coordinate            149716 non-null float64
Latitude                149716 non-null float64
Longitude               149716 non-null float64
Location                149716 non-null object
Location Description    155585 non-null object
dtypes: bool(2), datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 16.9+ MB
In [164]:
sex_crimes.isnull().sum()
Out[164]:
ID                         0
Year                       0
Date                       0
Primary Type               0
Arrest                     0
Domestic                   0
District                   4
Ward                    3481
IUCR                       0
X Coordinate            5871
Y Coordinate            5871
Latitude                5871
Longitude               5871
Location                5871
Location Description       2
dtype: int64
In [74]:
sex_crimes_clean = sex_crimes.dropna()
sex_crimes_clean.isnull().sum().sum()
Out[74]:
0

Analysing the Sex crimes per district

In [166]:
sex_crimes_clean = sex_crimes_clean.loc[(sex_crimes_clean['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=sex_crimes_clean[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           size=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Sex Crimes (2001-present) per District")
Out[166]:
Text(0.5, 1, 'All Sex Crimes (2001-present) per District')

Doing some date time processing

In [167]:
sex_crimes_clean['Date'] = pd.to_datetime(sex_crimes_clean.Date) 
sex_crimes_clean['date'] = [d.date() for d in sex_crimes_clean['Date']]
sex_crimes_clean['time'] = [d.time() for d in sex_crimes_clean['Date']]

sex_crimes_clean['time'] = sex_crimes_clean['time'].astype(str)
empty_list = []
for timestr in sex_crimes_clean['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
sex_crimes_clean['seconds'] = empty_list

Analysing the Sex Crimes yearly

In [168]:
sex_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Sex Crimes')
plt.show()

From the above figure we can conclude that sex crimes were the highest in 2004 and lowest in 2017

Analysing the Arrests in relation to Sex Crimes

In [169]:
sex_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Sex Crimes')
plt.show()

The above figure shows that there were more arrests in relation with sex crimes

Implementing k-means

In [170]:
sub_data_sex = sex_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_sex = sub_data_sex.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_sex['IUCR'] = sub_data_sex.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_sex.head()
Out[170]:
Ward IUCR District
188646 17.0 281 8.0
188845 7.0 266 4.0
188879 17.0 1780 6.0
188903 34.0 1780 5.0
189027 35.0 1750 25.0
In [171]:
sub_data_sex['IUCR'] = (sub_data_sex['IUCR'] - sub_data_sex['IUCR'].min())/(sub_data_sex['IUCR'].max()-sub_data_sex['IUCR'].min())
sub_data_sex['Ward'] = (sub_data_sex['Ward'] - sub_data_sex['Ward'].min())/(sub_data_sex['Ward'].max()-sub_data_sex['Ward'].min())
sub_data_sex['District'] = (sub_data_sex['District'] - sub_data_sex['District'].min())/(sub_data_sex['District'].max()-sub_data_sex['District'].min())
In [172]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_sex).score(sub_data_sex) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [173]:
km = KMeans(n_clusters=3)
km.fit(sub_data_sex)
y = km.predict(sub_data_sex)
labels = km.labels_
sub_data_sex['Clusters'] = y
In [174]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_sex['Ward'])
y = np.array(sub_data_sex['IUCR'])
z = np.array(sub_data_sex['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_sex["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
In [175]:
sex_crimes_clean['Normalized_time'] = (sex_crimes_clean['seconds'] - sex_crimes_clean['seconds'].min())/(sex_crimes_clean['seconds'].max()-sex_crimes_clean['seconds'].min())
In [176]:
sub_data_sex1 = sex_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_sex1['IUCR'] = sub_data_sex1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_sex1['IUCR'] = (sub_data_sex1['IUCR'] - sub_data_sex1['IUCR'].min())/(sub_data_sex1['IUCR'].max()-sub_data_sex1['IUCR'].min())
sub_data_sex1['District'] = (sub_data_sex1['District'] - sub_data_sex1['District'].min())/(sub_data_sex1['District'].max()-sub_data_sex1['District'].min())
sub_data_sex1.head()
Out[176]:
IUCR Normalized_time District
188646 0.004216 0.916805 0.291667
188845 0.001054 0.666767 0.125000
188879 0.320194 0.625094 0.208333
188903 0.320194 0.604258 0.166667
189027 0.313870 0.416729 1.000000
In [177]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_sex1).score(sub_data_sex1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [178]:
km = KMeans(n_clusters=4)
km.fit(sub_data_sex1)
y = km.predict(sub_data_sex1)
labels = km.labels_
sub_data_sex1['Clusters'] = y
sub_data_sex1.head()
Out[178]:
IUCR Normalized_time District Clusters
188646 0.004216 0.916805 0.291667 3
188845 0.001054 0.666767 0.125000 3
188879 0.320194 0.625094 0.208333 3
188903 0.320194 0.604258 0.166667 3
189027 0.313870 0.416729 1.000000 2
In [179]:
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_sex1['Normalized_time'])
y = np.array(sub_data_sex1['IUCR'])
z = np.array(sub_data_sex1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_sex1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

Standardizing datetime

In [75]:
from datetime import datetime
sex_crimes_clean['Date'] = pd.to_datetime(sex_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p') 
crime_data['Date']=  pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
In [76]:
for i in (sex_crimes_clean,crime_data):
    i['year']=i.Date.dt.year 
    i['month']=i.Date.dt.month 
    i['day']=i.Date.dt.day
    i['Hour']=i.Date.dt.hour
In [77]:
hour_by_type     = sex_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district     = sex_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)

Implementing Agglomerative clustering

In [78]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)
In [34]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)

From the figure we can answer the first question (What time of the day is the most patrolling required?) in regards to sex crimes:

The time of the day when sex crimes are most likely to occur is from the 10 pm to the early hours of the morning till 2 am.

Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.

In [79]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)

The figure above shows:

The safe districts in relation with sex crimes are 1, 13, 16, 17, 20, 21, 23.

Non-Violent Crimes

  1. Deceptive Practice
  2. Gambling
  3. Interference with Public Officer
  4. Intimidation
  5. Liquor Law Violation
  6. Other Narcotic Violation
  7. Other Offense
  8. Public Peace Violation
  9. Ritualism
  10. Stalking
In [80]:
nviolent_crimes = crime_data[col2]
nviolent_crimes = nviolent_crimes[nviolent_crimes['Primary Type']\
                  .isin(['DECEPTIVE PRACTICE','GAMBLING','INTERFERENCE WITH PUBLIC OFFICER','INTIMIDATION', 'LIQUOR LAW VIOLATION', 'OTHER NARCOTIC VIOLATION', 'OTHER OFFENSE', 'PUBLIC PEACE VIOLATION', 'RITUALISM', 'STALKING'])]

# clean some rouge (0,0) coordinates
nviolent_crimes = nviolent_crimes[nviolent_crimes['X Coordinate']!=0]


nviolent_crimes.head()
Out[80]:
ID Year Date Primary Type Arrest Domestic District Ward IUCR X Coordinate Y Coordinate Latitude Longitude Location Location Description
188569 11561837 2018 2018-12-31 23:59:00 DECEPTIVE PRACTICE False False 7.0 6.0 1153 1168573.0 1857018.0 41.763181 -87.657709 (41.763181359, -87.657709477) NaN
188585 11553486 2018 2018-12-31 23:30:00 DECEPTIVE PRACTICE False False 19.0 44.0 1150 1169040.0 1921647.0 41.940519 -87.654124 (41.940518859, -87.6541242) BAR OR TAVERN
188595 11552630 2018 2018-12-31 23:15:00 LIQUOR LAW VIOLATION True False 18.0 42.0 2250 1174647.0 1903665.0 41.891052 -87.634056 (41.891051707, -87.63405559) BAR OR TAVERN
188606 11554226 2018 2018-12-31 23:00:00 DECEPTIVE PRACTICE False False 16.0 41.0 1150 1100658.0 1934241.0 41.976290 -87.905227 (41.976290414, -87.905227221) AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA
188638 11584102 2018 2018-12-31 22:00:00 DECEPTIVE PRACTICE False False 8.0 14.0 1210 NaN NaN NaN NaN NaN OTHER
In [181]:
nv = sns.lmplot(x="X Coordinate",
               y="Y Coordinate",
               col="Primary Type",
               data=nviolent_crimes.dropna(), 
               col_wrap=2, size=6, fit_reg=False, 
               sharey=False,
               scatter_kws={"marker": "D",
                            "s": 10})
In [182]:
nviolent_crimes.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 736837 entries, 188569 to 6489285
Data columns (total 15 columns):
ID                      736837 non-null int64
Year                    736837 non-null int64
Date                    736837 non-null datetime64[ns]
Primary Type            736837 non-null object
Arrest                  736837 non-null bool
Domestic                736837 non-null bool
District                736828 non-null float64
Ward                    721830 non-null float64
IUCR                    736837 non-null object
X Coordinate            722096 non-null float64
Y Coordinate            722096 non-null float64
Latitude                722096 non-null float64
Longitude               722096 non-null float64
Location                722096 non-null object
Location Description    732663 non-null object
dtypes: bool(2), datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 80.1+ MB
In [183]:
nviolent_crimes.isnull().sum()
Out[183]:
ID                          0
Year                        0
Date                        0
Primary Type                0
Arrest                      0
Domestic                    0
District                    9
Ward                    15007
IUCR                        0
X Coordinate            14741
Y Coordinate            14741
Latitude                14741
Longitude               14741
Location                14741
Location Description     4174
dtype: int64
In [81]:
nviolent_crimes_clean = nviolent_crimes.dropna()
nviolent_crimes_clean.isnull().sum().sum()
Out[81]:
0

Analysing all Non-Violent Crimes per district

In [185]:
nviolent_crimes_clean = nviolent_crimes_clean.loc[(nviolent_crimes_clean['X Coordinate']!=0)]

sns.lmplot('X Coordinate',
           'Y Coordinate',
           data=nviolent_crimes_clean[:],
           fit_reg=False, 
           hue="District", 
           palette='Dark2',
           size=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10}) 
ax = plt.gca()
ax.set_title("All Sex Crimes (2001-present) per District")
Out[185]:
Text(0.5, 1, 'All Sex Crimes (2001-present) per District')

Doing some date time processing

In [186]:
nviolent_crimes_clean['Date'] = pd.to_datetime(nviolent_crimes_clean.Date) 
nviolent_crimes_clean['date'] = [d.date() for d in nviolent_crimes_clean['Date']]
nviolent_crimes_clean['time'] = [d.time() for d in nviolent_crimes_clean['Date']]

nviolent_crimes_clean['time'] = nviolent_crimes_clean['time'].astype(str)
empty_list = []
for timestr in nviolent_crimes_clean['time'].tolist():
    ftr = [3600,60,1]
    var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
    empty_list.append(var)
    
nviolent_crimes_clean['seconds'] = empty_list

Analysing the Non-Violent Crimes yearly

In [187]:
nviolent_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Non-Violent Crimes')
plt.show()

The figure above shows that the year with the most non-violent crimes was 2003 and the least was the year 2015

Analysis of Arrests in Non-Violent Crimes

In [188]:
nviolent_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Non-Violent Crimes')
plt.show()

The the figure above shows that there weren't many arrests in non-violent crime cases.

Implementing k-means

In [189]:
sub_data_nviolent = nviolent_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_nviolent = sub_data_nviolent.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_nviolent['IUCR'] = sub_data_nviolent.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_nviolent.head()
Out[189]:
Ward IUCR District
188585 44.0 1150 19.0
188595 42.0 2250 18.0
188606 41.0 1150 16.0
188657 33.0 2220 17.0
188662 20.0 502 3.0
In [190]:
sub_data_nviolent['IUCR'] = (sub_data_nviolent['IUCR'] - sub_data_nviolent['IUCR'].min())/(sub_data_nviolent['IUCR'].max()-sub_data_nviolent['IUCR'].min())
sub_data_nviolent['Ward'] = (sub_data_nviolent['Ward'] - sub_data_nviolent['Ward'].min())/(sub_data_nviolent['Ward'].max()-sub_data_nviolent['Ward'].min())
sub_data_nviolent['District'] = (sub_data_nviolent['District'] - sub_data_nviolent['District'].min())/(sub_data_nviolent['District'].max()-sub_data_nviolent['District'].min())
In [191]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_nviolent).score(sub_data_nviolent) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [192]:
km = KMeans(n_clusters=4)
km.fit(sub_data_nviolent)
y = km.predict(sub_data_nviolent)
labels = km.labels_
sub_data_nviolent['Clusters'] = y
In [193]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_nviolent['Ward'])
y = np.array(sub_data_nviolent['IUCR'])
z = np.array(sub_data_nviolent['District'])

ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_nviolent["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
In [194]:
nviolent_crimes_clean['Normalized_time'] = (nviolent_crimes_clean['seconds'] - nviolent_crimes_clean['seconds'].min())/(nviolent_crimes_clean['seconds'].max()-nviolent_crimes_clean['seconds'].min())
In [195]:
sub_data_nviolent1 = nviolent_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_nviolent1['IUCR'] = sub_data_nviolent1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_nviolent1['IUCR'] = (sub_data_nviolent1['IUCR'] - sub_data_nviolent1['IUCR'].min())/(sub_data_nviolent1['IUCR'].max()-sub_data_nviolent1['IUCR'].min())
sub_data_nviolent1['District'] = (sub_data_nviolent1['District'] - sub_data_nviolent1['District'].min())/(sub_data_nviolent1['District'].max()-sub_data_nviolent1['District'].min())
sub_data_nviolent1.head()
Out[195]:
IUCR Normalized_time District
188585 0.145860 0.979189 0.750000
188595 0.381810 0.968772 0.708333
188606 0.145860 0.958356 0.625000
188657 0.375375 0.907660 0.666667
188662 0.006864 0.898632 0.083333
In [196]:
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_nviolent1).score(sub_data_nviolent1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
In [197]:
km = KMeans(n_clusters=4)
km.fit(sub_data_nviolent1)
y = km.predict(sub_data_nviolent1)
labels = km.labels_
sub_data_nviolent1['Clusters'] = y
sub_data_nviolent1.head()
Out[197]:
IUCR Normalized_time District Clusters
188585 0.145860 0.979189 0.750000 3
188595 0.381810 0.968772 0.708333 3
188606 0.145860 0.958356 0.625000 3
188657 0.375375 0.907660 0.666667 3
188662 0.006864 0.898632 0.083333 1
In [198]:
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_nviolent1['Normalized_time'])
y = np.array(sub_data_nviolent1['IUCR'])
z = np.array(sub_data_nviolent1['District'])

ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')

ax.scatter(x,y,z, marker="o", c = sub_data_nviolent1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()

Standardizing datetime

In [82]:
from datetime import datetime
nviolent_crimes_clean['Date'] = pd.to_datetime(nviolent_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p') 
crime_data['Date']=  pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
In [83]:
for i in (nviolent_crimes_clean,crime_data):
    i['year']=i.Date.dt.year 
    i['month']=i.Date.dt.month 
    i['day']=i.Date.dt.day
    i['Hour']=i.Date.dt.hour
In [84]:
hour_by_type     = nviolent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district     = nviolent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)

Implementing Agglomerative clustering

In [85]:
from sklearn.cluster import AgglomerativeClustering as AC

def scale_df(df,axis=0):
    return (df - df.mean(axis=axis)) / df.std(axis=axis)


def plot_hmap(df, ix=None, cmap='PuRd'):
    if ix is None:
        ix = np.arange(df.shape[0])
    plt.imshow(df.iloc[ix,:], cmap=cmap)
    plt.colorbar(fraction=0.03)
    plt.yticks(np.arange(df.shape[0]), df.index[ix])
    plt.xticks(np.arange(df.shape[1]))
    plt.grid(False)
    plt.show()
    
def scale_and_plot(df, ix = None):
    df_marginal_scaled = scale_df(df.T).T
    if ix is None:
        ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
    cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
    df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
    plot_hmap(df_marginal_scaled, ix=ix)
In [42]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)

From the figure we can answer the first question (What time of the day is the most patrolling required?) in regards to non-violent crimes:

The time of the day when non-violent crimes are most likely to occur is from the 10 am to the early hours of the morning till 1 am.

Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.

In [86]:
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)

The figure above shows:

The safe districts in relation with non-violent crimes are 1,2,14-22,24.

The figure above shows:

The safe districts in relation with non-violent crimes are 1,2,14-22,24.

CONCLUSION

HOW DOES THE ANALYSIS HELP LAW ENFORCEMENT?

image.png